The purpose of this notebook is to explore linear regression techniques from the Coursera specialization in Machine Learning.
# Load the required packages
library(tidyverse)
library(here)
library(janitor)
The data we are using for this exploration is the Perth House Prices data as found on Kaggle.
# Load the data
house_prices <-
read_csv(here("data/src/all_perth_310121.csv"))
# View the head
head(house_prices)
## # A tibble: 6 × 19
## ADDRESS SUBURB PRICE BEDROOMS BATHROOMS GARAGE LAND_AREA FLOOR_AREA
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 Acorn Place South… 565000 4 2 2 600 160
## 2 1 Addis Way Wandi 365000 3 2 2 351 139
## 3 1 Ainsley Court Camil… 287000 3 1 1 719 86
## 4 1 Albert Street Belle… 255000 2 1 2 651 59
## 5 1 Aman Place Lockr… 325000 4 1 2 466 131
## 6 1 Amethyst Cresc… Mount… 409000 4 2 1 759 118
## # ℹ 11 more variables: BUILD_YEAR <chr>, CBD_DIST <dbl>, NEAREST_STN <chr>,
## # NEAREST_STN_DIST <dbl>, DATE_SOLD <chr>, POSTCODE <dbl>, LATITUDE <dbl>,
## # LONGITUDE <dbl>, NEAREST_SCH <chr>, NEAREST_SCH_DIST <dbl>,
## # NEAREST_SCH_RANK <dbl>
We will perform the following transformations to produce a clean dataset to work with for this project:
janitor::clean_names() functionGARAGE, BUILD_YEAR from character
to numericDATE_SOLD into separate columns for year and
monthPOSTCODE from numeric to characterWe convert the column names to lowercase to make them easier to work with in code.
house_prices_cln <-
house_prices |>
clean_names()
names(house_prices_cln)
## [1] "address" "suburb" "price" "bedrooms"
## [5] "bathrooms" "garage" "land_area" "floor_area"
## [9] "build_year" "cbd_dist" "nearest_stn" "nearest_stn_dist"
## [13] "date_sold" "postcode" "latitude" "longitude"
## [17] "nearest_sch" "nearest_sch_dist" "nearest_sch_rank"
We correct the data types for some of the fields. Before doing this we check the contents of each column we want to convert.
garagehouse_prices_cln |>
count(garage)
## # A tibble: 26 × 2
## garage n
## <chr> <int>
## 1 1 5290
## 2 10 26
## 3 11 7
## 4 12 30
## 5 13 8
## 6 14 13
## 7 16 4
## 8 17 1
## 9 18 3
## 10 2 20724
## # ℹ 16 more rows
All values are legitimate numbers. We can convert this column to numeric.
house_prices_cln <-
house_prices_cln |>
mutate(garage = as.numeric(garage))
house_prices_cln |>
count(garage)
## # A tibble: 26 × 2
## garage n
## <dbl> <int>
## 1 1 5290
## 2 2 20724
## 3 3 2042
## 4 4 1949
## 5 5 362
## 6 6 466
## 7 7 97
## 8 8 129
## 9 9 17
## 10 10 26
## # ℹ 16 more rows